Classification Models using Tidymodels

Introduction to machine learning

About the Author👨‍🎓🎓

Bongani Ncube is a Data Scientist with a background in mathematical statistics . I have expertise in statistical modeling ,data analysis and data visualisation using R programming language . I am the founder of Neat Elite Research and Data Analytics consultancy which is a analytics consultancy that helps students and companies draw insights from their data .

Background

GoalZone is a fitness club chain in Canada. GoalZone offers a range of fitness classes in two capacities . Some classes are always fully booked. Fully booked classes often have a low attendance rate. GoalZone wants to increase the number of spaces available for classes. They want to do this by predicting whether the member will attend the class or not. If they can predict a member will not attend the class, they can make another space available.

Setup

  • lets load all required packages and look at their versions
pacman::p_load("tidyverse",
               "tidymodels",
               "magrittr",
               "here", 
               "scales", 
               "glmnet", 
               "stacks", 
               "janitor", 
               "finetune", 
               "vip",
               "data.table",
               "DT",
               "alluvial",
               "extrafont",
               "gt",
               "gtsummary")


lst <- c("tidyverse",
               "tidymodels",
               "magrittr",
               "here", 
               "scales", 
               "glmnet", 
               "stacks", 
               "janitor", 
               "finetune", 
               "vip",
               "data.table",
               "DT",
               "alluvial",
               "extrafont",
               "gt",
               "gtsummary")

as_tibble(installed.packages())  |>
  select(Package, Version)  |> 
  filter(Package %in% lst) |> 
  datatable(style="bootstrap", class="table-condensed",
          options = list(dom = 'tp', scrollX = TRUE))
doParallel::registerDoParallel()
knitr::include_graphics("diction.PNG")

data preparation

  • booking_id : same as description with no missing values
  • months_as_member : same as description with no missing values
  • weight : had 20 missing values so i replaced with mean (during data preprocessing)
  • days_before : had values with “days” string so i removed it to get numbers only
  • day_of_week : there was repition of days with different names so i collapsed them for example ‘Fri. and Fri’ ,‘Mon and Monday’ Then ‘Wed and Wednesday’
  • Time : same as description with no missing values
  • category : replaced an unknown value “-” with “unknown”
  • attended : same as description
# read in the data

df<-readr::read_csv("fitness_class_2212.csv")

datatable(head(df,100),style="bootstrap", class="table-condensed",
          options = list(dom = 'tp', scrollX = TRUE))

data processing and preparation

df |> 
  tabyl(day_of_week)
#>  day_of_week   n     percent
#>          Fri 279 0.186000000
#>         Fri.  26 0.017333333
#>          Mon 218 0.145333333
#>       Monday  10 0.006666667
#>          Sat 202 0.134666667
#>          Sun 213 0.142000000
#>          Thu 241 0.160666667
#>          Tue 195 0.130000000
#>          Wed  81 0.054000000
#>    Wednesday  35 0.023333333

df |> 
  tabyl(category)
#>  category   n     percent
#>         -  13 0.008666667
#>      Aqua  76 0.050666667
#>   Cycling 376 0.250666667
#>      HIIT 667 0.444666667
#>  Strength 233 0.155333333
#>      Yoga 135 0.090000000
  • firstly the data looks a bit messy,
  • the column should have 7 levels representing 7 days of the week
  • Wed and Wednesday is one thing so we collapse them to one level
  • Fri. and Fri are one thing same as Mon and Monday
  • we have an unknown level in category shown by “-’
  • days before is supposed to be a discrete column but its showing as character just because the values contain strings (we need to fix that as well)

Preprocessing and Exploration of data

Data exploration and analysis is typically an iterative process, in which the data scientist takes a sample of data, and performs the following kinds of task to analyze it and test hypotheses:

  • Clean data to handle errors, missing values, and other issues.

  • Apply statistical techniques to better understand the data, and how the sample might be expected to represent the real-world population of data, allowing for random variation.

  • Visualize data to determine relationships between variables, and in the case of a machine learning project, identify features that are potentially predictive of the label.

  • Derive new features from existing ones that might better encapsulate relationships within the data.

  • Revise the hypothesis and repeat the process.

df<-df|> 
  mutate_at(c(5,6,7,8),as.factor) |>   # change these columns to factors
  mutate(day_of_week = 
           fct_collapse(day_of_week,Wed=c("Wed","Wednesday"),
                                  Fri=c("Fri.","Fri"), # collapse redundant levels
                                  Mon=c("Mon","Monday")),
         category=fct_collapse(category,unknown=c("-")), # add category(unknown)
         days_before=readr::parse_number(days_before)) # leave only numbers in the dataset
  • lets look at the data one more time
datatable(head(df,100),style="bootstrap", class="table-condensed",
          options = list(dom = 'tp', scrollX = TRUE))

Amazing! Our data is much more tidy-r! 💫

perfoming explanatory data analysis

df|>
  mutate(weight = replace_na(weight, mean(weight, na.rm = T)),
         attended = ifelse(attended==1,"yes","no"))|> 
  select(attended,time,day_of_week,category,weight,months_as_member) |> 
  tbl_summary(
    by = attended,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"), 
    label = day_of_week ~ "time of week")|> 
  add_p(test = all_continuous() ~ "t.test",
        pvalue_fun = function(x) style_pvalue(x, digits = 2))|> 
   modify_header(statistic ~ "**Test Statistic**")|>
  bold_labels()|> 
  modify_fmt_fun(statistic ~ style_sigfig) 
Characteristic no, N = 1,0461 yes, N = 4541 Test Statistic p-value2
time 3.7 0.054
    AM 781 / 1,046 (75%) 360 / 454 (79%)
    PM 265 / 1,046 (25%) 94 / 454 (21%)
time of week 7.0 0.33
    Fri 211 / 1,046 (20%) 94 / 454 (21%)
    Mon 163 / 1,046 (16%) 65 / 454 (14%)
    Sat 139 / 1,046 (13%) 63 / 454 (14%)
    Sun 142 / 1,046 (14%) 71 / 454 (16%)
    Thu 163 / 1,046 (16%) 78 / 454 (17%)
    Tue 136 / 1,046 (13%) 59 / 454 (13%)
    Wed 92 / 1,046 (8.8%) 24 / 454 (5.3%)
category 0.55
    unknown 11 / 1,046 (1.1%) 2 / 454 (0.4%)
    Aqua 51 / 1,046 (4.9%) 25 / 454 (5.5%)
    Cycling 266 / 1,046 (25%) 110 / 454 (24%)
    HIIT 454 / 1,046 (43%) 213 / 454 (47%)
    Strength 171 / 1,046 (16%) 62 / 454 (14%)
    Yoga 93 / 1,046 (8.9%) 42 / 454 (9.3%)
weight 85 (13) 77 (10) 12 <0.001
months_as_member 11 (7) 25 (17) -16 <0.001
1 n / N (%); Mean (SD)
2 Pearson's Chi-squared test; Fisher's exact test; Welch Two Sample t-test
  • though the p-value is slightly above 5% ,there is some association between time of the day and attendance
  • there is significant difference in mean weight and mean days as member between those who attended and those who did not
df|>
  mutate(attended = ifelse(attended==1,"yes","no"))|>
  group_by(attended,category,time) |> 
  summarize(count=n()) |> 
  mutate(prop=count/sum(count)) |> 
  ggplot(aes(x = category, y = count,fill=attended))+
  geom_bar(stat = "identity")+
  tvthemes::scale_fill_avatar()+
  facet_wrap(~time,scales="free")+
  tvthemes::theme_theLastAirbender(title.font="Slayer",
                                   text.font = "Slayer")+
  theme(legend.position = 'right',
        axis.text.x = element_text(face="bold", 
                                   color="brown",
                                   size=8, 
                                   angle=45))

  • its quite clear that most of the trainers attended morning sessions
  • most people registered the HIIT and cycling class

statistics for weight grouped by attendance status

df|>
  mutate(attended = ifelse(attended==1,"yes","no"))|> 
   group_by(attended)|> 
   summarize(count=n(),
             mean=mean(weight, na.rm = T),
             std_dev=sd(weight, na.rm = T),
             median=median(weight, na.rm = T), 
             min=min(weight, na.rm = T),
             max=max(weight, na.rm = T))|> 
   column_to_rownames(var = "attended") |> 
  as.matrix()
#>     count     mean  std_dev median   min    max
#> no   1046 85.01258 12.97347  82.89 55.41 170.52
#> yes   454 77.09441 10.35717  74.85 57.83 121.38

#Calculate mortality rate
print(paste('total attendance rate (%) is ',round(454/(454+1046)*100,digits=2)))
#> [1] "total attendance rate (%) is  30.27"

pointer

  • it could be intuitive to say that the number of months that someone has been a member can affect his attendance (is it practical?😩😅)
  • lets examine the number of months as a member variable

Measures of central tendency

To understand the distribution better, we can examine so-called measures of central tendency; which is a fancy way of describing statistics that represent the “middle” of the data. The goal of this is to try to find a “typical” value. Common ways to define the middle of the data include:

  • The mean: A simple average based on adding together all of the values in the sample set, and then dividing the total by the number of samples.

  • The median: The value in the middle of the range of all of the sample values.

  • The mode: The most commonly occuring value in the sample set*.

Let’s calculate these values, along with the minimum and maximum values for comparison, and show them on the histogram.

library(statip)


min_val <- min(df$months_as_member)
max_val <- max(df$months_as_member)
mean_val <- mean(df$months_as_member)
med_val <- median(df$months_as_member)
mod_val <- mfv(df$months_as_member)

# Print the stats
glue::glue(
  'Minimum: {format(round(min_val, 2), nsmall = 2)}
   Mean: {format(round(mean_val, 2), nsmall = 2)}
   Median: {format(round(med_val, 2), nsmall = 2)}
   Mode: {format(round(mod_val, 2), nsmall = 2)}
   Maximum: {format(round(max_val, 2), nsmall = 2)}'
)
#> Minimum: 1.00
#> Mean: 15.63
#> Median: 12.00
#> Mode: 8.00
#> Maximum: 148.00


id_02_hist <- df |> 
  ggplot() +
  geom_histogram(aes(x = months_as_member), 
                 fill = "firebrick", alpha = 0.66) +
  labs(title = "months_as_member distribution") +
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())+
  ggthemes::scale_fill_tableau()+
  tvthemes::theme_theLastAirbender(title.font="Slayer",text.font = "Slayer")+
  geom_vline(xintercept = min_val, color = 'gray33', linetype = "dashed", size = 1.3)+
  geom_vline(xintercept = mean_val, color = 'cyan', linetype = "dashed", size = 1.3)+
  geom_vline(xintercept = med_val, color = 'red', linetype = "dashed", size = 1.3 )+
  geom_vline(xintercept = mod_val, color = 'yellow', linetype = "dashed", size = 1.3 )+
  geom_vline(xintercept = max_val, color = 'gray33', linetype = "dashed", size = 1.3 )


id_02_log_hist <- df |> 
  ggplot() +
  geom_histogram(aes(x = months_as_member), 
                 fill = "firebrick", alpha = 0.66) +
  labs(title = "months_as_member log distribution") +
  scale_x_log10() +
  theme(plot.title = element_text(hjust = 0.5, size = 14),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())+
  ggthemes::scale_fill_tableau()+
  tvthemes::theme_theLastAirbender(title.font="Slayer",text.font = "Slayer")

gridExtra::grid.arrange(id_02_hist, id_02_log_hist)

  • first dataset shows data without a log transformation looks like we have an outlier ( we will deal with it later)
  • the maximum value looks way above threshold(guess he has been an old member of the fitness club)
  • the next one shows after log transforming our data

boxplots

library(patchwork)
# Plot a box plot
p1<-df |> 
  ggplot(aes(x = 1, y = months_as_member))+
  geom_boxplot(fill=tvthemes::avatar_pal()(1),alpha = 0.7)+
  tvthemes::scale_fill_avatar()+
  tvthemes::theme_avatar()+
  ggtitle("Data Distribution \n with an outlier")+
  xlab("")+
  ylab("months as a member")+
  theme(plot.title = element_text(hjust = 0.5))

p2<- df |> 
  filter(months_as_member<125) |> 
  ggplot(aes(x = 1, y = months_as_member))+
  geom_boxplot(fill=tvthemes::rickAndMorty_pal()(1),alpha = 0.7)+
  tvthemes::scale_fill_avatar()+
  tvthemes::theme_avatar()+
  ggtitle("Data Distribution \n without an outlier")+
  xlab("")+
  ylab("months as a member")+
  theme(plot.title = element_text(hjust = 0.5))

p3<- df |> 
  filter(months_as_member<100) |> 
  ggplot(aes(x = 1, y = months_as_member))+
  geom_boxplot(fill=tvthemes::hilda_pal()(1),alpha = 0.7)+
  tvthemes::scale_fill_avatar()+
  tvthemes::theme_avatar()+
  ggtitle("Data Distribution \n without many outliers")+
  xlab("")+
  ylab("months as a member")+
  theme(plot.title = element_text(hjust = 0.5))

p1+p2+p3

  • as seen before ,there is indeed an outlier here
  • we have tried to remove them
subset <- df |>
  dplyr::mutate(days_before=as.numeric(days_before),
                attended = ifelse(attended==1,"yes","no")) |> 
  dplyr::select(days_before,months_as_member,weight,attended)


# Bring in external file for visualisations
source('functions/visualisations.R')

# Use plot function
plot <- histoplotter(subset, attended, 
                     chart_x_axis_lbl = "attendence status", 
                     chart_y_axis_lbl = 'Measures',
                     boxplot_color = 'navy', 
                     boxplot_fill = '#89CFF0', 
                     box_fill_transparency = 0.2) 

# Add extras to plot
plot + 
  tvthemes::theme_theLastAirbender() +
  tvthemes::scale_color_attackOnTitan()+
  theme(legend.position = 'top') 

  • we have an outlier as seen from previous histogram

Alluvial Diagram


tbl_summary <- df |> 
  dplyr::mutate(days_before=as.numeric(days_before),
                attended = ifelse(attended==1,"yes","no")) |>
  group_by(attended, day_of_week, category) |>
  summarise(N = n()) |>
  ungroup() |>
  na.omit()


alluvial(tbl_summary[, c(1:4)],
         freq=tbl_summary$N, border=NA,
         col=ifelse(tbl_summary$attended == "yes", "blue", "gray"),
         cex=0.65,
         ordering = list(
           order(tbl_summary$attended, tbl_summary$day_of_week=="Wed"),
           order(tbl_summary$category, tbl_summary$day_of_week=="Wed"),
           NULL,
           NULL))

are the classes balanced?

loadfonts(quiet=TRUE)

iv_rates <- df |>
  mutate(attended=ifelse(attended==1,"yes","no"),
         attended=as.factor(attended)) |> 
  group_by(attended) |>
  summarize(count = n()) |> 
  mutate(prop = count/sum(count)) |>
  ungroup() 

plot<-iv_rates |>
  ggplot(aes(x=attended, y=prop, fill=attended)) + 
  geom_col(color="black",width = 0.5)+ 
  theme(legend.position="bottom") + 
  geom_label(aes(label=scales::percent(prop)), color="white") + 
  labs(
    title = "attendance ratio",
    subtitle = "Fitness analysis",
    y = "proportion(%)", 
    x = "attendance",
    fill="attendance",
    caption="B.Ncube::Data enthusiast") + 
  scale_y_continuous(labels = scales::percent)+ 
  tvthemes::scale_fill_kimPossible()+
  tvthemes::theme_theLastAirbender(title.font="Slayer",
                                   text.font = "Slayer")+
  theme(legend.position = 'right')
plot

Against my better judgement, am moving straight to modelling!

Data Budgeting

In this section, we allocate specific subsets of data for different tasks.

set.seed(2056)
# Load data
raw <- df
# Convert 0s and 1s (outcome) into a factor and split data
bb_split <- raw |> 
  select(-booking_id) |> 
  mutate(attended = if_else(attended==1, "yes", "no"),
         attended = factor(attended)) |> 
  initial_split(prob=0.75)


bb_train <- training(bb_split)
bb_test <- testing(bb_split)

create a model recipe

train_rcp <- recipes::recipe(attended ~ ., data=bb_train)|> 
  step_impute_mean(weight) |> 
  step_scale(all_numeric_predictors()) |> 
  step_dummy(all_nominal_predictors()) |> 
  themis::step_smote(attended, over_ratio = 0.97, neighbors = 3)
 

# Prep and bake the recipe so we can view this as a seperate data frame
training_df <- train_rcp|> 
  prep()|> 
  juice()

# Class imbalance resolved
after_smote <- training_df %$%
  table(attended) |> 
  prop.table() |> 
  unclass() 

print(after_smote)
#> attended
#>        no       yes 
#> 0.5077419 0.4922581
  • we applied Synthetic Minority Oversampling - which is a nearest neighbours method of oversampling we need to check what has happened to the binary labels (negative or sick):

Training the Logistic Regression baseline model

The theory is that if a simple linear classifier does a better job than a more complex algorithm, then stick with good old logistic regression.

lr_mod <- parsnip::logistic_reg()|> 
   set_engine('glm') |> 
  set_mode("classification")  
 

Creating the model workflow

lr_wf <-
  workflow()|> 
  add_model(lr_mod)|> 
  add_recipe(train_rcp)

These are easy to explain:

  1. I create a workflow for the model
  2. I add the model that I have initialized in the preceding step
  3. I add the recipe previously created

Next, I will kick off the training process:

lr_fit <- 
  lr_wf|> 
  fit(data=bb_train)

Extracting the fitted data

I want to pull the data fits into a tibble I can explore. This can be done below:

options(scipen=999)

lr_fitted <- lr_fit|> 
  extract_fit_parsnip()|> 
  tidy()
lr_fitted
#> # A tibble: 16 x 5
#>    term              estimate std.error statistic  p.value
#>    <chr>                <dbl>     <dbl>     <dbl>    <dbl>
#>  1 (Intercept)       -14.6     457.       -0.0321 9.74e- 1
#>  2 months_as_member    1.63      0.119    13.7    8.84e-43
#>  3 weight             -0.139     0.0771   -1.80   7.20e- 2
#>  4 days_before        -0.669     0.424    -1.58   1.15e- 1
#>  5 day_of_week_Mon    -1.07      0.845    -1.27   2.06e- 1
#>  6 day_of_week_Sat     0.477     0.299     1.59   1.12e- 1
#>  7 day_of_week_Sun     1.30      0.467     2.79   5.29e- 3
#>  8 day_of_week_Thu     0.0967    0.295     0.327  7.43e- 1
#>  9 day_of_week_Tue    -0.414     0.659    -0.627  5.30e- 1
#> 10 day_of_week_Wed    -0.947     0.504    -1.88   6.04e- 2
#> 11 time_PM            -0.380     0.164    -2.32   2.06e- 2
#> 12 category_Aqua      15.1     457.        0.0331 9.74e- 1
#> 13 category_Cycling   14.8     457.        0.0324 9.74e- 1
#> 14 category_HIIT      15.0     457.        0.0329 9.74e- 1
#> 15 category_Strength  14.6     457.        0.0320 9.74e- 1
#> 16 category_Yoga      14.7     457.        0.0321 9.74e- 1

I will visualise this via a bar chart to observe my significant features:


lr_fitted_add <- lr_fitted |> 
  mutate(Significance = ifelse(p.value < 0.05, 
                               "Significant", "Insignificant"))|> 
  arrange(desc(p.value)) 
#Create a ggplot object to visualise significance
plot <- lr_fitted_add|> 
  ggplot(mapping = aes(x=term, y=p.value, fill=Significance)) +
  geom_col() + 
  ggthemes::scale_fill_tableau() +
  theme(axis.text.x = element_text(face="bold", 
                                   color="#0070BA",
                                   size=8, 
                                   angle=90)) +
  labs(y="P value", 
       x="Terms",
       title="P value significance chart",
       subtitle="significant variables in the model",
       caption="Produced by Bongani Ncube")

plotly::ggplotly(plot) 
  • only day_of_week:sunday and time:seem to be significant

Set up model (xgboost model)


xgboost_mod <- boost_tree(trees=tune(), 
                          tree_depth = tune())|> 
  set_mode('classification')|> 
  set_engine('xgboost')

Hyperparameter tuning and K-Fold Cross Validation

Here, as stated, I will do an iterative search for the best parameters to pass to my model:

# Set the selected parameters in the grid
boost_grid <- dials::grid_regular(
  trees(), 
  tree_depth(), 
  levels=5) #Number of combinations to try

# Create the resampling method i.e. K Fold Cross Validation
folds <- vfold_cv(bb_train, k=5)

Create XGBoost workflow

I will now implement the workflow to manage the XGBoost model:

xgboost_wf <- workflow()|>
  add_model(xgboost_mod)|> 
  add_recipe(train_rcp)

Once I have this I can then go about iterating through the best combinations of fold and hyperparameter:

We will now select the best model:

best_model <- xgboost_fold|> 
  #select_best('accuracy')
  select_best('roc_auc')

Visualising the results:

xgboost_fold|> 
  collect_metrics()|> 
  mutate(tree_depth = factor(tree_depth))|> 
  ggplot(aes(trees, mean, color = tree_depth)) +
  geom_line(size=1.5, alpha=0.6) +
  geom_point(size=2) +
  facet_wrap(~ .metric, scales='free', nrow=2) +
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option='plasma', begin=.9, end =0) + 
  ggthemes::scale_color_tableau()+
  ggthemes::theme_fivethirtyeight()

Finalise the workflow and fit best model

I will now finalise my workflow by selecting the best hyperparameters for the job:

final_wf <- 
  xgboost_wf|> 
  finalize_workflow(best_model)

print(final_wf)
#> == Workflow ====================================================================
#> Preprocessor: Recipe
#> Model: boost_tree()
#> 
#> -- Preprocessor ----------------------------------------------------------------
#> 4 Recipe Steps
#> 
#> * step_impute_mean()
#> * step_scale()
#> * step_dummy()
#> * step_smote()
#> 
#> -- Model -----------------------------------------------------------------------
#> Boosted Tree Model Specification (classification)
#> 
#> Main Arguments:
#>   trees = 1
#>   tree_depth = 4
#> 
#> Computational engine: xgboost

# Final fit of our fold and hyperparameter combination

final_xgboost_fit <- 
  final_wf|> 
  last_fit(bb_split)

Collect metrics for evaluation

The final step would be to collect the metrics for evaluation. We will dedicate a seperate section to the evaluation of our models:

final_xgboost_fit|> 
  collect_metrics()
#> # A tibble: 2 x 4
#>   .metric  .estimator .estimate .config             
#>   <chr>    <chr>          <dbl> <chr>               
#> 1 accuracy binary         0.797 Preprocessor1_Model1
#> 2 roc_auc  binary         0.849 Preprocessor1_Model1

Next we will look at the workflow fit:


# This extracts the workflow fit
workflow_xgboost_fit <- final_xgboost_fit|> 
  extract_workflow()

# This extracts the parsnip model
xgboost_model_fit <- final_xgboost_fit|> 
  extract_fit_parsnip() 

xgboost_model_fit|> 
  vip(geom = "point")

Model evaluation

As we are following on from the xgboost model building, we will evaluate this first and then compare to our baseline model:

Use fitted XGBoost model to predict on testing set

The aim here is to check that our predictions match up with our ground truth labels. The class labels will be determined by probabilities that are higher than 0.5, however we are going to tweak this threshold to only allow a label if the probability is greater than 0.7:

# Pass our test data through model
testing_fit_class <- predict(workflow_xgboost_fit, bb_test)
testing_fit_probs <- predict(workflow_xgboost_fit, bb_test, type='prob')
# Bind this on to our test data with the label to compare ground truth vs predicted
predictions<- cbind(bb_test,testing_fit_probs, testing_fit_class)|>
  dplyr::mutate(xgboost_model_pred=.pred_class,
                xgboost_model_prob=.pred_yes)|> 
  dplyr::select(everything(), -c(.pred_class, .pred_no))|> 
  dplyr::mutate(xgboost_class_custom = ifelse(xgboost_model_prob >0.7,"yes","no"))|> 
  dplyr::select(-.pred_yes)

Use fitted logistic regression model to predict on test set

We are going to now append our predictions from our model we created as a baseline to append to the predictions we already have in the predictions data frame:

testing_lr_fit_probs <- predict(lr_fit, bb_test, type='prob')
testing_lr_fit_class <- predict(lr_fit, bb_test)

predictions<- cbind(predictions,
                    testing_lr_fit_probs, 
                    testing_lr_fit_class)

predictions <- predictions|> 
  dplyr::mutate(log_reg_model_pred=.pred_class,
                log_reg_model_prob=.pred_yes)|> 
  dplyr::select(everything(), -c(.pred_class, .pred_no))|> 
  dplyr::mutate(log_reg_class_custom = ifelse(log_reg_model_prob >0.7,"yes","no"))|> 
  dplyr::select(-.pred_yes)


# Get a head view of the finalised data
head(predictions)
#>   months_as_member weight days_before day_of_week time category attended
#> 1               17  79.56           8         Wed   PM Strength       no
#> 2               10  79.01           2         Mon   AM     HIIT       no
#> 3                5  86.12          10         Fri   AM  Cycling       no
#> 4               15  69.29           8         Thu   AM     HIIT       no
#> 5               13  73.22           4         Tue   AM  Cycling       no
#> 6               33  81.55           8         Thu   AM  Cycling      yes
#>   xgboost_model_pred xgboost_model_prob xgboost_class_custom log_reg_model_pred
#> 1                yes          0.5407097                   no                 no
#> 2                 no          0.4563613                   no                 no
#> 3                 no          0.3761200                   no                 no
#> 4                 no          0.4794471                   no                yes
#> 5                 no          0.4794471                   no                 no
#> 6                yes          0.6074300                   no                yes
#>   log_reg_model_prob log_reg_class_custom
#> 1          0.2158120                   no
#> 2          0.3621105                   no
#> 3          0.1448192                   no
#> 4          0.5957813                   no
#> 5          0.4900168                   no
#> 6          0.9115647                  yes

Evaluating the models with the ConfusionTableR package

The default caret confusion_matrix function saves everything as a string and doesn’t allow you to work with the values from the output.

This is the problem the ConfusionTableR package solves and means that you can easily store down the variables into a textual output, as and when needed.

Evaluate Logistic Regression baseline

First, I will evaluate my baseline model using the package:

cm_lr <- ConfusionTableR::binary_class_cm(
  train_labels = as.factor(predictions$log_reg_class_custom),
  truth_labels = as.factor(predictions$attended),
  positive='yes', mode='everything'
  )
cm_lr$confusion_matrix
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  240  59
#>        yes  19  57
#>                                          
#>                Accuracy : 0.792          
#>                  95% CI : (0.7474, 0.832)
#>     No Information Rate : 0.6907         
#>     P-Value [Acc > NIR] : 0.00000724     
#>                                          
#>                   Kappa : 0.462          
#>                                          
#>  Mcnemar's Test P-Value : 0.00001006     
#>                                          
#>             Sensitivity : 0.4914         
#>             Specificity : 0.9266         
#>          Pos Pred Value : 0.7500         
#>          Neg Pred Value : 0.8027         
#>               Precision : 0.7500         
#>                  Recall : 0.4914         
#>                      F1 : 0.5937         
#>              Prevalence : 0.3093         
#>          Detection Rate : 0.1520         
#>    Detection Prevalence : 0.2027         
#>       Balanced Accuracy : 0.7090         
#>                                          
#>        'Positive' Class : yes            
#> 

The baseline model performs pretty well. You can see this is the result of fixing our imbalance. Let’s work with it in a row wise fashion, as we can extract some metrics we my be interested in:

# Get record level confusion matrix for logistic regression model
cm_rl_log_reg <- cm_lr$record_level_cm
accuracy_frame <- tibble(
  Accuracy=cm_rl_log_reg$Accuracy,
  Kappa=cm_rl_log_reg$Kappa,
  Precision=cm_rl_log_reg$Precision,
  Recall=cm_rl_log_reg$Recall
)

Evaluate XGBoost Model

The next stage is to evaluate the XGBoost baseline. We will use this final evaluation to compare with our baseline model. Note: in reality this would be compared across many models:

cm_xgb <- ConfusionTableR::binary_class_cm(
  #Here you will have to cast to factor type as the tool expects factors
  train_labels = as.factor(predictions$xgboost_class_custom),
  truth_labels = as.factor(predictions$attended),
  positive='yes', mode='everything'
  )

# View the confusion matrix native
cm_xgb$confusion_matrix
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  259 116
#>        yes   0   0
#>                                              
#>                Accuracy : 0.6907             
#>                  95% CI : (0.6412, 0.7371)   
#>     No Information Rate : 0.6907             
#>     P-Value [Acc > NIR] : 0.5251             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.0000             
#>             Specificity : 1.0000             
#>          Pos Pred Value :    NaN             
#>          Neg Pred Value : 0.6907             
#>               Precision :     NA             
#>                  Recall : 0.0000             
#>                      F1 :     NA             
#>              Prevalence : 0.3093             
#>          Detection Rate : 0.0000             
#>    Detection Prevalence : 0.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : yes                
#> 
  • now extract the predictions and then I will bind the predictions on to the original frame to view what the difference is:
# Get record level confusion matrix for the XGBoost model
cm_rl_xgboost <- cm_xgb$record_level_cm

accuracy_frame_xg <- tibble(
  Accuracy=cm_rl_xgboost$Accuracy,
  Kappa=cm_rl_xgboost$Kappa,
  Precision=cm_rl_xgboost$Precision,
  Recall=cm_rl_xgboost$Recall
)

# Bind the rows from the previous frame 
accuracy_frame <- rbind(accuracy_frame, accuracy_frame_xg)
accuracy_frame
#> # A tibble: 2 x 4
#>   Accuracy Kappa Precision Recall
#>      <dbl> <dbl>     <dbl>  <dbl>
#> 1    0.792 0.462      0.75  0.491
#> 2    0.691 0         NA     0

Conclusion

  • this article served to show you some of the ways to visualise data as well as fitting logistic and xgboost using tidymodels.
  • the logistic regression model performed considerably better than xgboost

References

  • Iannone R, Cheng J, Schloerke B, Hughes E, Seo J (2022). gt: Easily Create Presentation-Ready Display Tables. R package version 0.8.0, https://CRAN.R-project.org/package=gt.

  • Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org

  • Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686

Further reading